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Abstract. Given a set of points J in a high dimensional space, the 
problem of finding a union of subspaces UiVi C M, N that best explains 
the data T increases dramatically with the dimension of R . In this 
article, we study a class of transformations that map the problem into 
another one in lower dimension. We use the best model in the low 
dimensional space to approximate the best solution in the original high 
dimensional space. We then estimate the error produced between this 
solution and the optimal solution in the high dimensional space. 

1. Introduction 

Given a set of vectors (points) T = {/i, . . . , f m } in a Hilbert space % 
(finite or infinite dimensional), the problem of finding a union of subspaces 
UiVi C H that best explains the data T has applications to mathematics 
and engineering O [TH [121 HSl HH US [H] . The subspaces V t allowed 
in the model are often constrained. For example the subspaces V{ may be 
constrained to belong to a family of closed subspaces C @|. A typical example 
for H = M. N is when C is the set of subspaces of dimension k << N. If C 
satisfies the so called Minimum Subspace Approximation Property (MSAP), 
an optimal solution to the non-linear subspace modeling problem that best 
fit the data exists, and algorithms to find these subspaces were developed [I]. 
Necessary and sufficient conditions for C to satisfy the MSAP are obtained 
in [5]. 

In some applications the model is a finite union of subspaces and % is finite 
dimensional. Once the model is found, the given data points can be clustered 
and classified according to their distances from the subspaces, giving rise to 
the so called subspace clustering problem (see e.g., [9] and the references 
therein). Thus a dual problem is to first find a "best partition" of the data. 
Once this partition is obtained, the associated optimal subspaces can be 
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easily found. In any case, the search for an optimal partition or optimal 
subspaces usually involves heavy computations that dramatically increases 
with the dimensionality of %. Thus one important feature is to map the 
data into a lower dimensional space, and solve the transformed problem in 
this lower dimensional space. If the mapping is chosen appropriately, the 
original problem can be solved exactly or approximately using the solution 
of the transformed data. 

In this article, we concentrate on the non-linear subspace modeling prob- 
lem when the model is a finite union of subspaces of M. N of dimension 
k « N. Our goal is to find transformations from a high dimensional space 
to lower dimensional spaces with the aim of solving the subspace modeling 
problem using the low dimensional transformed data. We find the optimal 
data partition for the transformed data and use this partition for the origi- 
nal data to obtain the subspace model associated to this partition. We then 
estimate the error between the model thus found and the optimal subspaces 
model for the original data. 

2. Preliminaries 

Since one of our goals is to model a set of data by a union of subspaces, 
we first provide a measure of how well a given set of data can be modeled 
by a union of subspaces. 

We will assume in this article that the data belongs to the finite dimen- 
sional space R N . There is no loss of generality in doing that, since it is 
easy to see that the subspaces of any optimal solution belong to the span 
of the data, which is a finite dimensional subspace of our (possible infinite 
dimensional) Hilbert space, (see [3], Lemma 4.2). So we can assume that 
the initial Hilbert space is the span of the data. 

Definition 2.1. Given a set of vectors J- = {/i,...,/m} 

in R N , a real 

number p > and positive integers I, k < N we will say that the data T is 
(I, k, /^-sparse if there exist subspaces V\, . . . , Vi of with dim(V^) < k for 
i = 1, . . . , I, such that 

m 

e(T, {V 1 ,...,V l })=J2 mm d 2 (/i, V 3 ) < p, 

where d stands for the euclidean distance in M> N . 

When T is (I, k, 0)-sparse, we will simply say that T is (I, A:)-sparse. 

Note that if T is (/, /c)-sparse, there exist I subspaces V\, . . . , Vi of dimen- 
sion at most k, such that 

For the general case p > 0, the (I, k, /9)-sparsity of the data implies that T 
can be partitioned into a small number of subsets, in such a way that each 
subset belongs to or is at no more than /^-distance from a low dimensional 
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subspace. The collection of these subspaces provides an optimal non-linear 
sparse model for the data. 

Observe that if the data T is (I, k, p)-sparse, a model which verifies Def- 
inition 12.11 provides a dictionary of length not bigger than Ik (and in most 
cases much smaller) in which our data can be represented using at most k 
atoms with an error smaller than p. 

More precisely, let {V\, . . . , Vi} be a collection of subspaces which satisfies 
Definition 12.11 and D a set of vectors from (L Vj that is minimal with the 
property that its span contains Uj^r Then for each / £ T there exists 
A C D with #A < k such that 

||/ — ^^a 9 g|[2 < p, for some scalars a g . 

In [3] the authors studied the problem of finding, for each given set of 
pairs (l,k), the minimum p-sparsity value of the data. They also provided 
an algorithm for finding the optimal value of p, as well as the optimal sub- 
spaces associated with p and the corresponding optimal partition of the 
data. Specifically, denote by B the collection of bundles of subspaces of R , 

B = {B = {V 1 ,...,V l } : dim(Vi) < k, i = l,...,l}, 

and for T = {fx, . . . , f m } a finite subset of R , define 

e (T) := mf{e(T,B) : B £ B}. (1) 

As a special case of a general theorem in [4] we obtain the next theorem. 

Theorem 2.2. Let T = {fx, ■ ■ ■ , f m } be vectors in W N , and let I and k be 

given (I < m, k < N), then there exists a bundle Bq = {V^°, . . . , V^ } € B 
such that 

e{T, B ) = e (.F) = inf{e(^, B) : B e B}. (2) 
Any bundle Bq £ B satisfying (0j will be called an optimal bundle for T . 

The following relations between partitions of the indices {1, . . . ,m} and 
bundles will be relevant for our analysis. 

We will denote by IIj({1, . . . , m}) the set of all /-sequences S = {Sx, ■ ■ ■ , Si} 
of subsets of {1, ... , to} satisfying the property that for all 1 < i,j < I, 

l 

S r = {1, . . . , m} and Si PI Sj = for i ^ j. 

r=X 

We want to emphasize that this definition does not exclude the case when 
some of the Si are the empty set. By abuse of notation, we will still call the 
elements of n^({l, . . . , m}) partitions of {1, . . . , to}. 

Definition 2.3. Given a bundle B = {Vx, ■ ■ ■ , Vi} € B, we can split the set 
{1, . . . , to} into a partition S = {Sx, • • • , Si} G II; ({1, . . . , m}) with respect 
to that bundle, by grouping together into Si the indices of the vectors in T 
that are closer to a given subspace Vi than to any other subspace Vj, j ^ i. 
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Thus, the partitions generated by B are defined by S = {Si, . . . , Si} 6 
ri;({l, . . . , m}), where 

j G Si if and only if d(f j} Vi) < d(fj, V h ), V/t = 1, . . . , I. 

We can also associate to a given partition S £ IT; the bundles in B as 
follows: 

Definition 2.4. Given a partition S = {S±, . . . , Si} £ II/, a bundle B = 
{Vi, . . . , Vi} G B is generated by S if and only if for every i = 1, . . . , I, 

d 2 {fj, Vi) < Y d 2 (fj, W) for all subspaces W such that dim(PF) < k. 

In this way, for a given data set T, every bundle has a set of associated 
partitions (those that are generated by the bundle) and every partition has 
a set of associated bundles (those that are generated by the partition) . Note 
however, that the fact that S is generated by B does not imply that B is 
generated by S, and vice versa. However, if Bq is an optimal bundle that 
solves the problem for the data T as in Theorem 12.2} then in this case, the 
partition So generated by Bq also generates Bq. On the other hand not 
every pair (B, S) with this property produces the minimal error eo(J r )- 

Here and subsequently, the partition So generated by the optimal bundle 
Bq will be called an optimal partition for T . 

If M is a set of data and V is a subspace of R , we will denote by E(M, V) 
the mean square error of the data M to the subspace V, i.e. 

E(M,V)= Y,d 2 (f,V). (3) 

f<EM 

3. Main results 

The problem of finding the optimal union of subspaces that best models 
a given set of data T when the dimension of the ambient space N is large 
is computationally expensive. When the dimension k of the subspaces is 
considerably smaller than N, it is natural to map the data onto a lower- 
dimensional subspace, solve an associated problem in the lower dimensional 
space and map the solution back into the original space. Specifically, given 
the data set J 7 = {/i, . . . , f m } C R N which is (/, k, p)-sparse and a sampling 
matrix A £ IR rxAr , with r << N, find the optimal partition of the sampled 
data T 1 := A(T) = {A fx, . . . , Af m } C M. r , and use this partition to find an 
approximate solution to the optimal model for J- . 

3.1. Dimensionality reduction: The ideal case p = 0. In this section 
we will assume that the data T = {fx, . . . , f m } C W N is (/, A;)-sparse, i.e., 
there exist I subspaces of dimension at most k such that T lies in the union of 
these subspaces. For this ideal case, we will show that we can always recover 
the optimal solution to the original problem from the optimal solution to 
the problem in the low dimensional space as long as the low dimensional 
space has dimension r > k. 
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We will begin with the proof that for any sampling matrix A G M rxAr , the 
measurements T 1 = A(T) are (I, fc)-sparse in R r . 

Lemma 3.1. Assume the data T = {fx, f m } C R N is (I, k) -sparse and 
let AeR rxN . Then T' := A{T) = {Ah, Af m } C W is (I, k) -sparse. 

Proof. Let Vf, . . . , V t ° be optimal spaces for T . Since 

dim(A(tf)) < dim(y i °) < k VI < % < I, 

and 

i 

rc[jA(v°), 

i=i 

it follows that W := {^(Vf ), . . . , ^(V^ )} is an optimal bundle for T' and 
e(T',W) = 0. 

□ 

Let T = {/i, . . . , f m } C R N be (I, fc)-sparse and A G W xN . By Lemma 
13.11 T' is (I, fc)-sparse. Thus, there exists an optimal partition S = {Si, . . . , Si} 
for T' in 11/ ({1, . . . , m}), such that 

i 

r 'c [jWi, 

i=l 

where Wi := span{Afj}j £ Si and dim(W / j) < k. Note that {Wi, . . . , W{\ is 
an optimal bundle for T' . 

We can define the bundle Bs = {Vi, . . . , Vi} by 

V$ := span{ fj} j&Si , VI < i < I. (4) 

Since S G IL({1, . . . , m}), we have that 

l 

Thus, the bundle -Bs will be optimal for T if dim(Vi) < k, V 1 < i < Z. The 
above discussion suggests the following definition: 

Definition 3.2. Let J" = {/i, . . . , f m } C 1^ be A;)-sparse. We will call 
a matrix A G M rxAr admissible for J 7 if for every optimal partition S for J 7 ', 
the bundle Bs defined by ^ is optimal for F. 

The next proposition states that almost all A G !R rxAr are admissible for 

T. 

The Lebesgue measure of a set E C will be denoted by 

Proposition 3.3. Assume the data T = {fi, f m } C R N is (I, k) -sparse 
and let r > k. Then, almost all A € M. rxN are admissible for T . 
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Proof. If a matrix A € M rx " is not admissible, there exists an optimal 
partition S G II; for T 1 such that the bundle Bs = {Vi,...,Vi} is not 
optimal for T . 

Let T>k be the set of all the subspaces V in M N of dimension bigger than 
k, such that V = span{fj}j £ s with S C {1, . . . , m}. 

Thus, we have that the set of all the matrices of M rxAr which are not 
admissible for T is contained in the set 

\J {A G R rxN : dim(A{V)) < k}. 

vev k 

Note that the set is finite, since there are finitely many subsets of 
{1, . . . ,m}. Therefore, the proof of the proposition is complete by showing 
that for a fixed subspace V C R , such that dim(V) > k, it is true that 

|{ieR rxAr : dim(A{V)) < k}\ = 0. (5) 

Let then V be a subspace such that dim(V) = t > k. Given {v±, . . . , vt] a 
basis for V, by abuse of notation, we continue to write V for the matrix in 
jgiAfxt -yyj^h vectors Vi as columns. Thus, proving ([5]) is equivalent to proving 
that 

\{A £ R rxN : iwk(AV) < k}\ = 0. (6) 
As min{r, i} > k, the set {A € M rxAr : rank(j4I/) < k} is included in 

{A e R rxN : det(V*A*AV) = 0}. (7) 

Since det(V* A* AV) is a non-trivial polynomial in the r x N coefficients of 
A, the set ([7]) has Lebesgue measure zero. Hence, © follows. 

□ 

3.2. Dimensionality reduction: The non-ideal case p > 0. Even if a 
set of data is drawn from a union of subspaces, in practice it is often cor- 
rupted by noise. Thus, in general p > 0, and our goal is to estimate the 
error produced when we solve the associated problem in the lower dimen- 
sional space and map the solution back into the original space. 

Intuitively, if A € R rxAr is an arbitrary matrix, the set T = AT will pre- 
serve the original sparsity only if the matrix A does not change the geometry 
of the data in an essential way. One can think that in the ideal case, since 
the data is sparse, it actually lies in an union of low dimensional subspaces 
(which is a very thin set in the ambient space). 

However, when the data is not 0-sparse, but only p-sparse with p > 0, the 
optimal subspaces plus the data do not lie in a thin set. This is the main 
obstacle in order to obtain an analogous result as in the ideal case. 

Far from having the result that for almost any matrix A the geometry of 
the data will be preserved, we have the Johnson-Lindenstrauss lemma, that 
guaranties - for a given data set - the existence of one such matrix A. 

In what follows, we will use random matrices to obtain positive results 
for the p > case. 
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Let (fi, Pr) be a probability measure space. Given r, N G N, a random 
matrix A w G M rxAr is a matrix with entries (A u )ij = aij(u}), where {ciij} are 
independent and identically distributed random variables for every 1 < i < r 
and 1 < j < N. 

Definition 3.4. We say that a random matrix G M rxAr satisfies the 
concentration inequality if for every < e < 1, there exists Co = co(e) > 
(independent of r, N) such that for any x G Mr, 



Such matrices are easy to come by as the next proposition shows [T]. 

Proposition 3.5. Let A w G M rxAr be a random matrix whose entries are 
chosen independently from either M(0, j) or {7^7=, -}^} Bernoulli. Then A w 

2 3 

satisfies ^) with co(e) = \ — 

By using random matrices A u satisfying ([8]) to produce the lower dimen- 
sional data set J- , we will be able to recover with high probability an optimal 
partition for T using the optimal partition of T . 

Below we will state the main results of Section 13.21 and we will give their 
proofs in Section HI 

Note that by Lemma [3.11 if T = • • • , / m } Q K w is (/, k, 0)-sparse, 
then Au{J-) is (I, k, 0)-sparse for all u G 0. The following proposition is 
a generalization of Lemma 13.11 to the case where J- is (l,k, p)-sparse with 
p > 0. 

Proposition 3.6. Assume the data T = {fi,---,f m } Q is (l,k,p)- 
sparse with p > 0. // A u G IR rxJV is a random matrix which satisfies (Ejj, 
then A^T is (I, k, (1 + e)p)-sparse with probability at least 1 — 2me~ rc ° . 

Hence if the data is mapped with a random matrix which satisfies the con- 
centration inequality, then with high probability, the sparsity of the trans- 
formed data is close to the sparsity of the original data. Further, as the 
following theorem shows, we obtain an estimation for the error between T 
and the bundle generated by the optimal partition for T' = A^T . 

Note that, given a constant a > 0, the scaled data aT = {afi, ■ ■ ■ , af m } 
satisfies that e(aJ-, B) = a 2 e(J-, B) for any bundle B. So, an optimal bundle 
for T is optimal for aT, and vice versa. Therefore, we can assume that the 
data J- = {/1, . . . , f m } is normalized, that is, the matrix M G M. Nxm which 
has the vectors {fi, . . . , f m } as columns has unitary Frobenius norm. Recall 
that the Frobenius norm of a matrix M G H Nxm is defined by 



Pr((l - e)||x||| < ||4,a;||! < (1 + > 1 ~ 2e"' 



(8) 



N m 




(9) 



i=i j=i 



where M{j are the coefficients of M. 
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Theorem 3.7. Let T = {/i, . . . , f m } C M. be a normalized data set and 
< e < 1. Assume that A^ 6 R rxAr is a random matrix satisfying f2|) and 
S w is an optimal partition for T 1 = A W T in W . If B w is a bundle generated 
by the partition S w and the data T in M. N as in Definition \2.3l then with 
probability exceeding 1 — (2m 2 + 4m)e _rc °, we have 

e(F,B ul )<(l + e)e (T)+£c 1 , (10) 

where c\ = (l(d — A;)) 1 / 2 and d = rank(J 7 ). 

Finally, we can use this theorem to show that the set of matrices which 
are r]-admissible (see definition below) is large. 

The following definition generalizes Definition 13.21 to the p-sparse setting, 
with p > 0. 

Definition 3.8. Assume T = {/i, . . . , f m } C K N is (/, k, /))-sparse and let 
< T) < 1. We will say that a matrix A 6 R rxAr is n-admissible for T if for 
any optimal partition S for T' = AT in R r , the bundle B$ generated by S 
and J- in W N , satisfies 

e(T,B s ) <p + v- 

We have the following generalization of Proposition 13.31 which provides 
an estimate on the size of the set of ^-admissible matrices. 

Corollary 3.9. Let T = {/i, . . . , f m } C ^ N be a normalized data set and 
< 7] < 1. Assume that A u G M rxiV is a random matrix which satisfies 
property |2)) for e = rj (1 + ^/l(d — k))~ . Then A^ is rj-admissible for F 
with probability at least 1 — (2m 2 + <im)e~ rc °( £ \ 

Proof. Using the fact that eo^) < E{F, {0}) = \\T\\ 2 = 1, we conclude 
from Theorem 13.71 that 

Pr(e(^,S u )<eo(JO+e(l + d)) > 1 " c 2 e- rc ° {£ \ (11) 

where c\ = (l(d — k)) 1 / 2 , d = rank(J 7 ), and C2 = 2m 2 + Am. That is, 

Pr^ep 7 ,^) < e (T)+r^j > 1 - (2m 2 + 4m)e- rc ° (e) . 

□ 

As a consequence of the previous corollary, we have a bound on the di- 
mension of the lower dimensional space to obtain a bundle which produces 
an error at mdistance of the minimal error with high probability. 

2 

Now, using that co(e) > f° r ran dom matrices with gaussian or Bernoulli 
entries (see Proposition 13. 5p . from Theorem 13.71 we obtain the following 
corollary. 

Corollary 3.10. Let -q, 5 € (0,1), be given. Assume that A u E M rxAr is a 
random matrix whose entries are as in Proposition \3.5[ 
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Then for every r satisfying, 

12(1 + JUd - k)) 2 /2m 2 +4m\ 
r " If ln ( 6 ) 

with probability at least 1 — 5 we have that 

e{JF,B u ) < eo(-F) + r/. 

We want to remark here that the results of subsection 13.21 are valid for 
any probability distribution that satisfies the concentration inequality (jSJ). 
The bound on the error is still valid for p = 0. However in that case we were 
able to obtain sharp results. 



4. Proofs 

4.1. Background and supporting results. Before proving the results of 
the previous section we need several known theorems, lemmas, and propo- 
sitions below. 

Given M G K mxm a Hermitian matrix, let Ai(M) > A 2 (M) > • •• > 
A m (M) be its eigenvalues and si(M) > s 2 (M) > ■ ■ • > s m (M) > be its 
singular values. 

Recall that the Frobenius norm defined in ([9]) satisfies that 

m 

\\M\\ 2 = £ Ml^^KM), 

l<i,j<m i=l 

where Mjj are the coefficients of M. 

Given x G M , we write ||x||2 for the £ 2 norm of x in M. N . 

Theorem 4.1. [8, Theorem III.4.1] 

Let A,B£ jj mxm b e Hermitian matrices. Then for any choice of indices 
1 < i\ < %i < ■ ■ ■ < if. < m, 

k k 
TT,(* ij (A)-X i .(B))<T>2\ j (A-B). 

3=1 3=1 

Corollary 4.2. Let A,B G W nxm be Hermitian matrices. Assume k and d 
are two integers which satisfy < k < d < m, then 

d 

| £ {X^-X^^id-k^WA-Bl 
j=k+i 



Proof. Since A — B is Hermitian, it follows that for each 1 < j < m there 

at 

X 1 (A-B)\=s lj (A-B). 



exists 1 < ij < m such that 
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From this and Theorem 14.11 we have 

d d— k d—k 

E (\j(A) - \j(B)) < "£x j (A-B)<'£ 8i .(A-B) 



j=k+i 



d-k 



i=i 



d-k 



< Y,s 3 {A-B)<{d-k)^[Y, s2 M- B )) 

3=1 3=1 

< (d - kf/ 2 \\A - B\\. 



1/2 



□ 



Remark 4.3. Note that the bound of the previous corollary is sharp. In- 
deed, let A E I B!<m be the diagonal matrix with coefficients an = 2 for 
1 < i < d, and an = otherwise. Let B E R mxm be the diagonal matrix 
with coefficients bu = 2 for 1 < i < k, bn = 1 for k + 1 < i < d, and bu = 
otherwise. Thus, 



| e ^M)-^m\ = \ E (2- 1 : 

j=k+l j=k+l 

Further \\A — B\\ = (d — k) 1 / 2 , and therefore 



d-k. 



| E (Ai(A)-A,(5)) =(d-fc) 1 / 2 p-S||. 
j=k+l 

Lemma 4.4. [7J Suppose that A u E K rxAr is a random matrix which satisfies 
and «,»£ M^, i/ien 

|(it,t>) - < e||u|| 2 ||u||2, 

probability at least 1 — 4e _rc °. 



The following proposition was proved in [16], but we include its proof for 
the sake of completeness. 

Proposition 4.5. Let A^ E W xN be a random matrix which satisfies 
and M E M Arxm be a matrix. Then, we have 

\\M*M - M*A* LU A U1 M\\ < e\\M\\ 2 , 

with probability at least 1 — 2(m 2 + m)e~ rc °. 

Proof. Set Y id {u) = (M*M - M* A^M)^ = {f i} ft) - (A^A^ft). By 
Lemma 14.41 with probability at least 1 — 4e _rc ° we have that 



l^-HI <e||/<|| 2 ||/il|2 
Note that if ([12]) holds for all 1 < i < j < m, then 



(12) 
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\\M*M - M*A* LU A LU M\\ 2 = y M'H 2 

l<i ,j<m 

< e 2 £ ll/*ll2ll/ill2 = ^IMI 4 - 

l<i ,j<m 

Thus, by the union bound, we obtain 



Yn(\\M*M - MM* A U M\\ < e\\M\\ 2 

>Vx(\Y^{u)\ <e||/i||2||/j||2 Vl<i<j<m 
/i< 



> 1 " Ei<,<,< m 4e— ° = 1 - 2(m 2 + mje""*. 



□ 



4.2. New results and proof of Theorem [373 Given M G R*'™ with 
columns {/i, . . . , / m } and a subspace V C M. N , let E(M, V) be as in ([3]), 
that is 

m 

E{M,V) = Y,d 2 {h,V). 

i=l 

We will denote the fc-minimal error associated with M by 
E k (M):= min E(M,V). 

V:dim(V)<k 

Let d := rank(M). Eckart- Young's Theorem (see [17j ) states that 

d 

E k (M)= £ A,(M*M), (13) 
i=fc+i 

where Ai (M*M) > ••• > A^(M*M) > are the positive eigenvalues of 
M*M. 

Lemma 4.6. Assume that M 6 R-^*" 1 a^rf A € M rxAr are arbitrary ma- 
trices. Let S € M^ Vxs be a submatrix of M. If d := rank(M) is such that 
< k < d, then 

\E k (S) - E k (AS)\ <{d- k) 1 ' 2 \\S*S - S*A*AS\\. 

Proof. Let d s := rank(S'). We have rank(ylS') < d s . If d s < k, the result is 
trivial. Otherwise by (|13p and Corollary 14.21 we obtain 

d s 

\E k (S) - E k (AS)\ = | £ (X j (S*S)-X j (S*A*AS)) 

j=k+l 

< {d s -k) 1 / 2 \\S*S - S*A*AS\\. 

As S is a submatrix of M, we have that (d s — k) l l 2 < (d — k) 1 / 2 , which 
proves the lemma. 

□ 
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Recall that eo(T) is the optimal value for the data J 7 , and eo(A tJ J r ) is the 
optimal value for the data T' = A^F (See ([I])). A relation between these 
two values is given by the following lemma. 

Lemma 4.7. Let T = {/i, . . . , f m } C 1^ and < e < 1. 7/ ^ w G W xN 

is a random matrix which satisfies $8\), then with probability exceeding 1 — 
2me~ rc ° , we have 

e (A u F) < (l + e)e (T). 

Proof. Let V C R N be a subspace and M G M Arxm be a matrix. Using (|S]) 
and the union bound, with probability at least 1 — 2me~ rc ° we have that 



E{A U M,A„V) = Y^cfiAuf^AuV) < jT \\A U U - A UJ {P v f i )f 2 
i=l i=l 

< (1 + e) J] ll/i " JV/illl = (1 + V), 



i=l 



where Py is the orthogonal projection onto V. 

Assume that S = {Si, . . . , 5/} is an optimal partition for J 7 and {V{, ... ,V{\ 
is an optimal bundle for T . Suppose that mi = #(Si) and Mj G ~ML Nxm ' are 
the matrices which have {fj}jeSi as columns. Prom what has been proved 
above and the union bound, with probability exceeding 1 — Xa=i 2mje _rc ° = 
1 - 2me- rc ° it holds 



eo(A^) < Yl E ( A ^Mi, A^Vi) < (1 + E ) J] E{M h Vi) = (1 + e)e (T). 

8=1 1=1 



□ 



Proof of Proposition \3.6l This is a direct consequence of Lemma 14.71 □ 



Proof of Theorem\3j\ If S w = {S£, . . . , S*,}, and mj, = #(5*), let M* G 
M' Vxm " be the matrices which have {fj}j^s i as columns. Since B u = 
{Vj, . . . , yj} is generated by S w and J", it follows that E{M* ii V*) = E k (Af*,). 
And as S w is an optimal partition for A^T in W , we have that X^!=i ^(Aj-^j) 
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Hence, using Lemma 14.61 Lemma 14.71 an d Proposition 14.51 with high 
probability it holds that 

I i 
e{F,B w ) < ^E(Mi,VZ) = ^E k (Mt) 

i=l i=l 
I I 

< E k{AuMl) + {d- kf' 2 Y, ICK " Mt* A^MiW 
i=i i=i 

< e Q (A w T) + {lid-k^^W^Ml - M^A^Mif) 1 ' 2 

i=i 

< (1 + e)e {F) + (l(d - k)) l/2 \\M*M - M*A* J A U1 M\\ 

< (l + e)e (T)+e(l(d-k)) 1 / 2 , 

where M € M. Nxm j g un itary Frobenius norm matrix which has the 
vectors {fx, ■ ■ ■ , f m } as columns. 

The right side of (|10p follows from Proposition 14.51 Lemma 14.71 an d the 
fact that 

Pr(e(j; B u ) < (1 + e)e (J r ) + e(l(d - k)) 1 ^ 



> Pr(\\M*M - M*A* J A U1 M\\ < e and e (AuT) < (1 + e)e Q (J r ) 

> 1 - (2(m 2 + m) e - rc ° + 2me- rco ) = 1 - (2m 2 + 4m) e - rc °. 

□ 



5. Conclusions and related work 

The existence of optimal union of subspaces models and an algorithm for 
finding them was obtained in [5]. In the present paper we have focused on 
the computational complexity of finding these models. More precisely, we 
studied techniques of dimension reduction for the algorithm proposed in [3]. 
These techniques can also be used in a wide variety of situations and are 
not limited to this particular application. 

We used random linear transformations to map the data to a lower di- 
mensional space. The "projected" signals were then processed in that space, 
(i.e. finding the optimal union of subspaces) in order to produce an optimal 
partition. Then we applied this partition to the original data to obtain the 
associated model for that partition and obtained a bound for the error. 

We have analyzed two situations. First we studied the case when the 
data belongs to a union of subspaces (ideal case with no noise). In that 
case we obtained the optimal model using almost any transformation (see 
Proposition I3.3|) . 

In the presence of noise, the data usually doesn't belong to a union of 
low dimensional subspaces. Thus, the distances from the data to an optimal 
model add up to a positive error. In this case, we needed to restrict the 
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admissible transformations. We applied recent results on distributions of 
matrices satisfying concentration inequalities, which also proved to be very 
useful in the theory of compressed sensing. 

We were able to prove that the model obtained by our approach is quasi 
optimal with a high probability. That is, if we map the data using a random 
matrix from one of the distributions satisfying the concentration law, then 
with high probability, the distance of the data to the model is bounded by the 
optimal distance plus a constant. This constant depends on the parameter 
of the concentration law, and the parameters of the model (number and 
dimension of the subspaces allowed in the model) . 

Let us remark here that the problem of finding the optimal union of 
subspaces that fit a given data set is also known as "Projective clustering". 
Several algorithms have been proposed in the literature to solve this problem. 
Particularly relevant is [10] (see also references therein) where the authors 
used results from volume and adaptive sampling to obtain a polynomial-time 
approximation scheme. See [2] for a related algorithm. 
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